Cache-based Statistical Language Models of English and Highly Inflected Lithuanian
نویسندگان
چکیده
منابع مشابه
Cache-based Statistical Language Models of English and Highly Inflected Lithuanian
This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs. dynamic) on the perplexity of a language model is studied. The best results are achieved by models consis...
متن کاملHighly-Inflected Language Generation Using Factored Language Models
Statistical language models based on n-gram counts have been shown to successfully replace grammar rules in standard 2-stage (or ‘generate-and-select’) Natural Language Generation (NLG). In highlyinflected languages, however, the amount of training data required to cope with n-gram sparseness may be simply unobtainable, and the benefits of a statistical approach become less obvious. In this wor...
متن کاملString Distance-Based Stemming of the Highly Inflected Croatian Language
Stemming refers to the grouping of morphologically related words into so-called stem classes for the purpose of improving information retrieval performance. Traditional approaches to stemming are language-specific and require a substantial amount of linguistic knowledge. A viable alternative is string distance-based stemming, in which stem classes are obtained by clustering word-forms from a co...
متن کاملLSA-based language model adaptation for highly inflected languages
This paper presents a language model topic adaptation framework for highly inflected languages. In such languages, subword units are used as basic units for language modeling. Since such units carry little semantic information, they are not very suitable for topic adaptation. We propose to lemmatize the corpus of training documents before constructing a latent topic model. To adapt language mod...
متن کاملStatistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Informatica
سال: 2006
ISSN: 0868-4952,1822-8844
DOI: 10.15388/informatica.2006.127